Article

Topic-based document segmentation with probabilistic latent semantic analysis

Authors:
Thorsten Brants

Palo Alto Research Center, Palo Alto, CA

Palo Alto Research Center, Palo Alto, CA
View Profile

,
Francine Chen

Palo Alto Research Center, Palo Alto, CA

Palo Alto Research Center, Palo Alto, CA
View Profile

,
Ioannis Tsochantaridis

Brown University, Providence, RI

Brown University, Providence, RI
View Profile

CIKM '02: Proceedings of the eleventh international conference on Information and knowledge managementNovember 2002Pages 211–218https://doi.org/10.1145/584792.584829

Published:04 November 2002Publication History

CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management

Pages 211–218

ABSTRACT

This paper presents a new method for topic-based document segmentation, i.e., the identification of boundaries between parts of a document that bear on different topics. The method combines the use of the Probabilistic Latent Semantic Analysis (PLSA) model with the method of selecting segmentation points based on the similarity values between pairs of adjacent blocks. The use of PLSA allows for a better representation of sparse information in a text block, such as a sentence or a sequence of sentences. Furthermore, segmentation performance is improved by combining different instantiations of the same model, either using different random initializations or different numbers of latent classes. Results on commonly available data sets are significantly better than those of other state-of-the-art systems.

References

A. Basu, I.R. Harris, and S. Basu. Minimum distance estimation: The approach using density-based distances. In G.S. Maddala and C.R. Rao, editors, Handbook of Statistics volume 15,pages 21--48. North-Holland, 1997.Google Scholar
D. Beeferman, A. Berger, and J. Lafferty. Statistical models for text segmentation. Machine Learning 34:177--210, 1999. Google ScholarDigital Library
D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. In Proceedings of NIPS-2001 Vancuver, BC, Canada, 2001.Google Scholar
T.Brants.Test data likelihood for PLSA models. In ACM SIGIR Workshop on Mathematical/Formal Methods in Information Retrieval Tampere, Finland, 2002.Google Scholar
F.Y.Y. Choi. Advances in domain independent linear text segmentation. In Proceedings of NAACL-2000 pages 26--33, Seattle, WA, 2000. Google ScholarDigital Library
F.Y.Y. Choi. Improving the efficiency of speech interfaces for text navigation. In Proceedings of the IEE colloquium: Speech and Language Processing for Disabled and Elderly People 2000.Google Scholar
F.Y.Y. Choi, P.Wiemer-Hastings, and J.More. Latent semantic analysis for text segmentation. In L.Lee and D.Harman, editors, Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing pages 109--117, 2001.Google Scholar
W.B. Croft, S.Cronen-Townsend, and V. Larvrenk. Relevance feedback and personalization: A language modeling perspective. In DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries 2001.Google Scholar
S.C. Deerwester, S.T. Dumais, T.K. Landauer, G.W. Furnas, and R.A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6):391--407, 1990.Google ScholarCross Ref
A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society 39(1):1--21,1977.Google Scholar
D. Gildea and T. Hofmann. Topic-based language models using em. In Proceedings of the 6th European Conference on Speech Communication and Technology (EUROSPEECH), pages 2167--2170, 1999.Google Scholar
M.A. Hearst and C. Plaunt. Subtopic structuring for full-length document access. In Research and Development in Information Retrieval pages 59--68, 1993. Google ScholarDigital Library
T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of SIGIR-99 pages 35--44, Berkeley, CA, 1999. Google ScholarDigital Library
T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning Journal 42(1):177--196, 2001. Google ScholarDigital Library
T. Kailath. The divergence and bhattacharyya distance measures in signal selection. IEEE Trans. Commun. Tech., COM-15:52--60,1967.Google ScholarCross Ref
H. Kozima. Text segmentation based on similarity between words. In Meeting of the Association for Computational Linguistics pages 286--288, 1993. Google ScholarDigital Library
S. Kullback and R.A. Leibler. On information and sufficiency. Annals of Mathematical Statistics 22:79--86, 1951.Google ScholarCross Ref
V. Lavrenk, J. Allan, E. DeGuzman, D. LaFlamme, V. Pollard, and S. Thomas. Topic-based language models using em. In Proceedings ofthe 6th European Conference on Speech Communication and Technology (EUROSPEECH), pages 2167--2170, 1999.Google Scholar
L. Lee. Measures of distributional similarity. In 37th Annual Meeting of the Association for Computational Linguistics pages 25--32, 1999. Google ScholarDigital Library
H. Li and K. Yamanishi. Topic analysis using a finite mixture model. In Proceedings of Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pages 35--44, 2000. Google ScholarDigital Library
H. Li and K. Yamanishi. Topic analysis using a finite mixture model. IPSJ SIGNotes Natural Language (NL), 139(009), 2000.Google Scholar
L. Pevzner and M. Hearst. A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics 28(1):19--36, 2002. Google ScholarDigital Library
J.W. Tukey. Exploratory Data Analysis Addison Wesley Longman,Inc., Reading, MA, 1977.Google Scholar

Index Terms

Topic-based document segmentation with probabilistic latent semantic analysis
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information retrieval

Recommendations

Unsupervised mining of long time series based on latent topic model

This paper presents a novel unsupervised method for mining time series based on two generative topic models, i.e., probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA). The proposed method treats each time series as a text ...
Read More
Latent semantic rational kernels for topic spotting on conversational speech

In this work, we propose latent semantic rational kernels (LSRK) for topic spotting on conversational speech. Rather than mapping the input weighted finite-state transducers (WFSTs) onto a high dimensional n-gram feature space as in n-gram rational ...
Read More
Aspect-based sentence segmentation for sentiment summarization
TSA '09: Proceedings of the 1st international CIKM workshop on Topic-sentiment analysis for mass opinion

Aspect-based sentiment summarization systems generally use sentences associated with relevant aspects extracted from the reviews as the basis for summarization. However, in real reviews, a single sentence often exhibits several aspects for opinions. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management
November 2002
704 pages
ISBN:1581134924
DOI:10.1145/584792
General Chair:
Charles Nicholas
University of Maryland Baltimore County
,
Program Chairs:
David Grossman
Illinois Institute of Technology
,
Konstantinos Kalpakis
University of Maryland Baltimore County
,
Sajda Qureshi
Erasmus University, Rotterdam
,
Han van Dissel
Erasmus University, Rotterdam
,
Len Seligman
The MITRE Corporation
Copyright © 2002 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 November 2002
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
PLSA
text segmentation
topic identification
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 71
  Total Citations
  View Citations
- 2,337
  Total Downloads
- Downloads (Last 12 months)31
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Topic-based document segmentation with probabilistic latent semantic analysis

CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Unsupervised mining of long time series based on latent topic model

Latent semantic rational kernels for topic spotting on conversational speech

Aspect-based sentence segmentation for sentiment summarization